=================================
Model Compression Techniques
=================================


Pruning Techniques
###################


.. list-table:: Method Comparison
   :widths: 25 15 15 25 15 15 25 25
   :header-rows: 1

   * - Method
     - Weight Update
     - Calibration Data
     - Pruning Metric
     - Complexity
     - Fine-Tuning
     - Support for LLM/Attention/Linear
     - Support for Convolutional Layer
   * - Granular-Magnitude
     - ✗
     - **✗**
     - :math:`|Wij|`
     - O(1)
     - ✓
     - ✓
     - ✓
   * - Channel-Wise Magnitude
     - ✗
     - ✓
     - :math:`|Wj|`
     - O(1)
     - ✓
     - ✓
     - ✓
   * - Optimal Brain Compression
     - ✗
     - ✓
     - :math:`|W|^2/diag(XXT + λI)−1`
     - O(d^3 hidden)
     - ✗
     - ✓
     - ✓
   * - SparseGPT
     - ✓
     - ✓
     - :math:`|W|^2/diag(XXT + λI)−1`
     - O(d^3 hidden)
     - ✗
     - ✓
     - ✗
   * - Wanda
     - ✗
     - ✓
     - :math:`|W_{ij}|. |X_{j}|_{2}`
     - O(d^2 hidden)
     - ✗
     - ✓
     - ✗
   * - **Venum**
     - **✗**
     - ✓
     - :math:`|W_{ij}|. |X_{j}|_{2}`
     - O(d^2 hidden)
     - Minimal(optional)
     - ✓
     - ✓


Quantization Techniques
########################

Currently, the package only suports Eager Mode Quantization. I look forward to integerate FX Graph Mode Quantization in the near future.

There are three types of quantization supported:

1. Dynamic Quantization:
   - Weights are quantized with activations read/stored in floating point and quantized for compute.

2. Static Quantization:
   - Weights are quantized.
   - Activations are quantized.
   - Calibration is required post-training.

3. Static Quantization Aware Training:
   - Weights are quantized.
   - Activations are quantized.
   - Quantization numerics are modeled during training.